[SOUND].
This lecture is about the syntagmatic
relation discovery, and entropy.
In this lecture, we're going to continue
talking about word association mining.
In particular, we're going to talk about
how to discover syntagmatic relations.
And we're going to start with
the introduction of entropy,
which is the basis for designing some
measures for discovering such relations.
By definition,
syntagmatic relations hold between words
that have correlated co-occurrences.
That means,
when we see one word occurs in context,
we tend to see the occurrence
of the other word.
So, take a more specific example, here.
We can ask the question,
whenever eats occurs,
what other words also tend to occur?
Looking at the sentences on the left,
we see some words that might occur
together with eats, like cat,
dog, or fish is right.
But if I take them out and
if you look at the right side where we
only show eats and some other words,
the question then is.
Can you predict what other words
occur to the left or to the right?
Right so
this would force us to think about what
other words are associated with eats.
If they are associated with eats,
they tend to occur in the context of eats.
More specifically our
prediction problem is to take
any text segment which can be a sentence,
a paragraph, or a document.
And then ask I the question,
is a particular word present or
absent in this segment?
Right here we ask about the word W.
Is W present or absent in this segment?
Now what's interesting is that
some words are actually easier
to predict than other words.
If you take a look at the three
words shown here, meat, the, and
unicorn, which one do you
think is easier to predict?
Now if you think about it for
a moment you might conclude that
the is easier to predict because
it tends to occur everywhere.
So I can just say,
well that would be in the sentence.
Unicorn is also relatively easy
because unicorn is rare, is very rare.
And I can bet that it doesn't
occur in this sentence.
But meat is somewhere in
between in terms of frequency.
And it makes it harder to predict because
it's possible that it occurs in a sentence
or the segment, more accurately.
But it may also not occur in the sentence,
so
now let's study this
problem more formally.
So the problem can be formally defined
as predicting the value of
a binary random variable.
Here we denote it by X sub w,
w denotes a word, so
this random variable is associated
with precisely one word.
When the value of the variable is 1,
it means this word is present.
When it's 0, it means the word is absent.
And naturally, the probabilities for
1 and 0 should sum to 1,
because a word is either present or
absent in a segment.
There's no other choice.
So the intuition with this concept earlier
can be formally stated as follows.
The more random this random variable is,
the more difficult the prediction will be.
Now the question is how does one
quantitatively measure the randomness of
a random variable like X sub w?
How in general, can we quantify
the randomness of a variable and
that's why we need a measure
called entropy and
this measure introduced in information
theory to measure the randomness of X.
There is also some connection
with information here but
that is beyond the scope of this course.
So for
our purpose we just treat entropy function
as a function defined
on a random variable.
In this case, it is a binary random
variable, although the definition can
be easily generalized for
a random variable with multiple values.
Now the function form looks like this,
there's the sum of all the possible
values for this random variable.
Inside the sum for each value we
have a product of the probability
that the random variable equals this
value and log of this probability.
And note that there is also
a negative sign there.
Now entropy in general is non-negative.
And that can be mathematically proved.
So if we expand this sum, we'll see that
the equation looks like the second one.
Where I explicitly plugged
in the two values, 0 and 1.
And sometimes when we have 0 log of 0,
we would generally define that as 0,
because log of 0 is undefined.
So this is the entropy function.
And this function will
give a different value for
different distributions
of this random variable.
And it clearly depends on the probability
that the random variable
taking value of 1 or 0.
If we plot this function against
the probability that the random
variable is equal to 1.
And then the function looks like this.
At the two ends,
that means when the probability of X
equals 1 is very small or very large,
then the entropy function has a low value.
When it's 0.5 in the middle
then it reaches the maximum.
Now if we plot the function
against the probability that X
is taking a value of 0 and the function
would show exactly the same curve here,
and you can imagine why.
And so that's because
the two probabilities are symmetric,
and completely symmetric.
So an interesting question you
can think about in general is for
what kind of X does entropy
reach maximum or minimum.
And we can in particular think
about some special cases.
For example, in one case,
we might have a random variable that
always takes a value of 1.
The probability is 1.
Or there's a random variable that
is equally likely taking a value of one or
zero.
So in this case the probability
that X equals 1 is 0.5.
Now which one has a higher entropy?
It's easier to look at the problem
by thinking of a simple example
using coin tossing.
So when we think about random
experiments like tossing a coin,
it gives us a random variable,
that can represent the result.
It can be head or tail.
So we can define a random variable
X sub coin, so that it's 1
when the coin shows up as head,
it's 0 when the coin shows up as tail.
So now we can compute the entropy
of this random variable.
And this entropy indicates how
difficult it is to predict the outcome
of a coin toss.
So we can think about the two cases.
One is a fair coin, it's completely fair.
The coin shows up as head or
tail equally likely.
So the two probabilities would be a half.
Right?
So both are equal to one half.
Another extreme case is
completely biased coin,
where the coin always shows up as heads.
So it's a completely biased coin.
Now let's think about
the entropies in the two cases.
And if you plug in these values you can
see the entropies would be as follows.
For a fair coin we see the entropy
reaches its maximum, that's 1.
For the completely biased coin,
we see it's 0.
And that intuitively makes a lot of sense.
Because a fair coin is
most difficult to predict.
Whereas a completely biased
coin is very easy to predict.
We can always say, well, it's a head.
Because it is a head all the time.
So they can be shown on
the curve as follows.
So the fair coin corresponds to the middle
point where it's very uncertain.
The completely biased coin
corresponds to the end
point where we have a probability
of 1.0 and the entropy is 0.
So, now let's see how we can use
entropy for word prediction.
Let's think about our problem is
to predict whether W is present or
absent in this segment.
Again, think about the three words,
particularly think about their entropies.
Now we can assume high entropy
words are harder to predict.
And so we now have a quantitative way to
tell us which word is harder to predict.
Now if you look at the three words meat,
the, unicorn, again, and
we clearly would expect meat to have
a higher entropy than the unicorn.
In fact if you look at the entropy of the,
it's close to zero.
Because it occurs everywhere.
So it's like a completely biased coin.
Therefore the entropy is zero.
[MUSIC]

